Tree-based Node Aggregation in Sparse Graphical Models

High-dimensional graphical models are often estimated using regularization that is aimed at reducing the number of edges in a network. In this work, we show how even simpler networks can be produced by aggregating the nodes of the graphical model. We develop a new convex regularized method, called the tree-aggregated graphical lasso or tag-lasso, that estimates graphical models that are both edge-sparse and node-aggregated. The aggregation is performed in a data-driven fashion by leveraging side information in the form of a tree that encodes node similarity and facilitates the interpretation of the resulting aggregated nodes. We provide an efficient implementation of the tag-lasso by using the locally adaptive alternating direction method of multipliers and illustrate our proposal’s practical advantages in simulation and in applications in finance and biology.


Introduction
Graphical models are greatly useful for understanding the relationships among large numbers of variables.Yet, estimating graphical models with many more parameters than observations is challenging, which has led to an active area of research on high-dimensional inverse covariance estimation.Numerous methods attempt to curb the curse of dimensionality through regularized estimation procedures (e.g., Meinshausen and Bühlmann, 2006;Yuan and Lin, 2007;Banerjee et al., 2008;Friedman et al., 2008;Rothman et al., 2008;Peng et al., 2009;Yuan, 2010;Cai et al., 2011Cai et al., , 2016)).Such methods aim for sparsity in the inverse covariance matrix, which corresponds to graphical models with only a small number of edges.A common method for estimating sparse graphical models is the graphical lasso (glasso) (Yuan and Lin, 2007;Banerjee et al., 2008;Rothman et al., 2008;Friedman et al., 2008), which adds an 1 -penalty to the negative log-likelihood of a sample of multivariate normal random variables.While this and many other methods focus on the edges for dimension reduction, far fewer contributions (e.g., Tan et al., 2015;Eisenach et al., 2020;Pircalabelu and Claeskens, 2020) focus on the nodes as a guiding principle for dimension reduction.
Nonetheless, node dimension reduction is becoming increasingly relevant in many areas where data are being measured at finer levels of granularity.For instance, in biology, modern high-throughput sequencing technologies provide low-cost microbiome data at high resolution; in neuroscience, brain activity in hundreds of regions of interest can be measured; in finance, data at the individual company level at short time scales are routinely analyzed; and in marketing, joint purchasing data on every stock-keeping-unit (product) is recorded.The fine-grained nature of this data brings new challenges.The sheer number of fine-grained, often noisy, variables makes it difficult to detect dependencies.Moreover, there can be a mismatch between the resolution of the measurement and the resolution at which natural meaningful interpretations can be made.The purpose of an analysis may be to draw conclusions about entities at a coarser level of resolution than happened to be measured.Because of this mismatch, practitioners are sometimes forced to devise ad hoc post-processing steps involving, for example, coloring the nodes based on some classification of them into groups q q q q q q q q q q q q q q q 1 2 3 True aggregated graph True aggregated Ω q q q q q q q q q q q q q q q 1 2 3 tag−lasso: aggregated graph tag−lasso: aggregated Ω q q q q q q q q q q q q q q q 1 2 3 in an attempt to make the structure of an estimated graphical model more interpretable and the domain-specific takeaways more apparent (e.g., Millington and Niranjan, 2019).
Our solution to this problem is to incorporate the side information about the relationship between nodes directly into the estimation procedure.In our framework, this side information is encoded as a tree whose leaves correspond to the measured variables.Such tree structures are readily available in many domains (e.g., taxonomies in biology and hierarchical classifications of jobs, companies, and products in business) and is well-suited to expressing multi-resolution structure that is present in many problems.We propose a new convex regularization procedure, called tag-lasso, which stands for tree-aggregated-graphical-lasso.
This procedure combines node (or variable) aggregation with edge-sparsity.The tree-based aggregation serves to both amplify the signal of similar, low-level variables and render a graphical model involving nodes at an appropriate level of scale to be relevant and interpretable.The edge-sparsity encourages the graphical model involving the aggregated nodes has a sparse network structure.
Our procedure is based on a tree-based parameterization strategy that translates the node aggregation problem into a sparse modeling problem, following an approach previously introduced in the regression setting (Yan and Bien, 2020).In Figure 1 (to be discussed more thoroughly in Section 4), we see that tag-lasso is able to recover the aggregated, sparse graph structure.By doing so, it yields a more accurate estimate of the true graph, and its output is easier to interpret than the full, noisy graph obtained by the glasso.
The rest of the paper is organized as follows.Section 2 introduces the tree-based parameterization structure for nodewise aggregation in graphical models.Section 3 introduces the tag-lasso estimator, formulated as a solution to a convex optimization problem, for which we derive an efficient algorithm.Section 4 presents the results of a simulation study.Section 5 illustrates the practical advantages of the tag-lasso on financial and microbiome data sets.

Node Aggregation in Penalized Graphical Models
Let S be the empirical covariance matrix based on n multivariate normal observations of dimension p, with mean vector µ and covariance matrix Σ.The target of estimation is the precision matrix Ω = Σ −1 , whose sparsity pattern provides the graph structure of the Gaussian graphical model, since Ω jk = 0 is equivalent to variables j and k being conditionally independent given all other variables.To estimate the precision matrix, it is common to use a convex penalization method of the form where tr(•) denotes the trace, P(•) is a convex penalty function, and λ > is a tuning parameter controlling the degree of penalization.Choosing the 1 -norm where Ω −diag contains the unique off-diagonal elements, yields the graphical lasso (glasso) (Friedman et al., 2008;Yuan and Lin, 2007;Banerjee et al., 2008;Rothman et al., 2008).It encourages Ω to be sparse, corresponding to a graphical model with few edges.
However, when Ω is not sparse, demanding sparsity in Ω may not be helpful, as we will show in Section 2.1.Such settings can arise when data are measured and analyzed at ever higher resolutions (a growing trend in many areas, see e.g.Callahan et al. 2017).A tree is a natural way to represent the different scales of data resolution, and we introduce a new choice for P that uses this tree to guide node aggregation, thereby allowing for a data adaptive choice of data scale for capturing dependencies.Such tree-based structures are available in many domains.For instance, companies can be aggregated according to hierarchical industry classification codes; products can be aggregated from brands towards product categories; brain voxels can be aggregated according to brain regions; microbiome data can be aggregated according to taxonomy.The resulting penalty function then encourages a more general and yet still highly interpretable structure for Ω.In the following subsection, we use a toy example to illustrate the power of such an approach.

Node Aggregation
Consider a toy example with p variables   where ε 1 , . . ., ε p are independent standard normal random variables.By construction, it is clear that there is a very simple relationship between the variables: The first two variables both depend on the sum of the other p − 2 variables.However, a standard graphical model on the p variables does not naturally express this simplicity.The first row of Table 1 shows the covariance and precision matrices for the full set of variables X 1 , . . ., X p .The graphical model on the full set of variables is extremely dense O(p 2 ) edges.Imagine if instead we could form a graphical model with only three variables: X 1 , X 2 , X, where the last variable X = p j=3 X j aggregates all but the first two variables.The bottom row of Table 1 results in a graphical model that matches the simplicity of the situation.

Nodes Covariance Matrix Precision
The lack of sparsity in the p-node graphical model means that the graphical lasso will not do well.Nonetheless, a method that could perform node aggregation would be able to yield a highly-interpretable aggregated sparse graphical model since X 1 and X 2 are conditionally independent given the aggregated variable X.
It is useful to map from the small aggregated graphical model to the original p-node graphical model.One does so by writing the precision matrix in "G-block" format (Bunea et al., 2020, although they introduce this terminology in the context of the covariance matrix, not its inverse) for a given partition G = {G 1 , ..., G K } of the nodes {1, . . ., p} and corresponding p×K membership matrix M, with entries M jk = 1 if j ∈ G k , and M jk = 0 otherwise.In particular, there exists a K ×K symmetric matrix C and a p×p diagonal matrix D such that the precision matrix can be written as Ω = MCM + D. The block-structure of Ω is captured by the first part of the decomposition, the aggregated K × K precision matrix on the set of aggregated nodes can then be written as

p} and
MCM has only three distinct rows/columns since the aggregated variables j = 3, . . ., p share all their entries.In the presence of node aggregation and edge sparsity, the graphical model corresponding to the aggregated precision matrix is far more parsimonious than the graphical model on the full precision matrix (see Table 1).As motivated by this example, our main goal is to estimate the precision matrix in such a way that we can navigate from a p-dimensional problem to a K-dimensional problem whose corresponding graphical model provides a simple description of the conditional dependency structure among K aggregates of the original variables.In the following proposition, we show that this can be accomplished by looking for a precision matrix that has a G-block structure.The proof of the proposition is included in Appendix A Proposition 2.1.Suppose X ∼ N p (0, Ω −1 ) with Ω = MCM + D, where M ∈ {0, 1} p×K is the membership matrix, D 0, and let X = M X ∈ R K be the vector of aggregated variables.Then X has precision matrix C + D agg , where D agg is a diagonal matrix, and therefore c ij = 0 is equivalent to the aggregates X i and X j being conditionally independent given all other aggregated variables.
While Proposition 2.1 gives us the desired interpretation in the graphical model with K aggregated nodes, in practice, the partition G, its size K, and corresponding membership matrix M are, however, unknown.Rather than considering arbitrary partitions of the variables, we constrain ourselves specifically to partitions guided by a known tree.In so doing, we allow ourselves to exploit side information and help ensure that the aggregated nodes will be easily interpretable.To this end, we introduce a tree-based parameterization strategy that allows us to embed the node dimension reduction into a convex optimization framework.

Tree-Based Parameterization
Our aggregation procedure assumes that we have, as side information, a tree that represents the closeness (or similarity) of variables.We introduce here a matrix-valued extension of the tree-based parameterization developed in Yan and Bien (2020) for the regression setting.We consider a tree T with p leaves Ω 1 , . . ., Ω p where Ω j denotes column 1 ≤ j ≤ p of Ω.We restrict ourselves to partitions that can be expressed as a collection of branches of T .Newly aggregated nodes are then formed by summing variables within branches.To this end, we assign a p-dimensional parameter vector γ u to each node u in the tree T (see Figure 2 for an example).Writing the set of nodes in the path from the root to the j th leaf (variable) as ancestor(j) ∪ {j}, we express each column/row in the precision matrix as where we sum over all the γ u 's along this path, and e j denotes the p-dimensional vector with all zeros except for its j th element that is equal to one.In the remainder, we will make extensive use of the more compact notation parameter matrix collecting the γ u 's in its rows and D is a diagonal parameter matrix with elements d 1 , . . ., d p .
By zeroing out γ u 's, certain nodes will be aggregated, as can be seen from the illustrative example in Figure 3.More precisely, let Z = {u : γ u = 0} denote the set of non-zero rows in Γ and let A Z be the sub-matrix of A where only the columns corresponding to the non-zeros rows in Γ are kept.The number of blocks K in the aggregated network is then given by the number of unique rows in A Z .The membership matrix M (Section 2.1), and hence the set γ 1:5 of aggregated nodes, can then be derived from the variables (rows) in the matrix A Z that share all their row-entries.
We are now ready to introduce the tag-lasso, which is based on this parameterization.

Tree Aggregated Graphical lasso
To achieve dimension reduction via node aggregation and edge sparsity simultaneously, we extend optimization problem (1) by incorporating the parameterization introduced above.
Our estimator, called the tag-lasso, is defined as with Γ −r 2,1 = u∈T −r γ u 2 and T −r being the set of all nodes in T other than the root.
This norm induces row-wise sparsity on all non-root rows of Γ.This row-wise sparsity, in turn, induces node aggregation as explained in Section 2.2.The root is excluded from this penalty term so that in the extreme of large λ 1 one gets complete aggregation but not necessarily sparsity (in this extreme, all off-diagonal elements of Ω are equal to the scalar γ that appears in the equality constraint involving γ r ).While λ 1 controls the degree of node aggregation, λ 2 controls the degree of edge sparsity.When λ 1 = 0, the optimization problem in (4) reduces to the glasso.
Finally, note that optimization problem (4) fits into the general formulation of penalized graphical models given in (1) since it can be equivalently expressed as where and P sparse (Ω) is the 1 -norm defined in (2).

Locally Adaptive Alternating Direction Method of Multipliers
We develop an alternating direction method of multipliers (ADMM) algorithm (Boyd et al., 2011), specifically tailored to solving (4).Our ADMM algorithm is based on solving this equivalent formulation of (4): min Additional copies of Ω and Γ are introduced to efficiently decouple the optimization problem.
Furthermore, we use an extension called locally adaptive-ADMM (LA-ADMM, Xu et al., 2017) with adaptive penalization to improve performance.The full details of the algorithm are provided in Appendix B.

Selection of the Tuning Parameters
To select the tuning parameters λ 1 and λ 2 , we form a 10 × 10 grid of (λ 1 , λ 2 ) values and find the pair that minimizes a 5-fold cross-validated likelihood-based score, 1 5 where Ω −F k is an estimate of the precision matrix trained while withholding the samples in the k th fold and S F k is the sample covariance matrix computed on the k th fold.In particular, we take Ω −F k to be a re-fitted version of our estimator (e.g., Belloni and Chernozhukov, 2013).After fitting the tag-lasso, we obtain Z = {u : γ u = 0}, the set of non-zero rows in Γ, which suggests a particular node aggregation; and P = {(i, j) : Ω ij = 0}, the set of non-zero elements in Ω, which suggests a particular edge sparsity structure.We then re-estimate Ω by maximizing the likelihood subject to these aggregation and sparsity constraints: min We solve this with an LA-ADMM algorithm similar to what is described in Section 3.1 and Appendix B.

Connections to Related Work
Combined forms of dimension reduction in graphical models can be found in, amongst others, Our work is most closely related to Pircalabelu and Claeskens (2020) who estimate a penalized graphical model and simultaneously classify nodes into communities.However, Pircalabelu and Claeskens (2020) do not use tree-based node-aggregation.Our approach, in contrast, considers the tree T as an important part of the problem to help determine the extent of node aggregation, and as a consequence the number of aggregated nodes (i.e. clusters, communities or blocks) K, in a data-driven way through guidance of the tree-based structure on the nodes.

Simulations
We investigate the advantages of jointly exploiting node aggregation and edge sparsity in graphical models.To this end, we compare the performance of the tag-lasso to two benchmarks: (i) oracle: The aggregated, sparse graphical model in ( 7) is estimated subject to the true aggregation and sparsity constraints.The oracle is only available for simulated data and serves as a "best case" benchmark.
(ii) glasso: This does not perform any aggregation (corresponding to the tag-lasso with λ 1 = 0).A sparse graph on the full set of variables is estimated.The glasso is computed using the same LA-ADMM algorithm as detailed in Appendix B. The tuning parameter is selected from a 10-dimensional grid as the value that minimizes the 5-fold cross-validation likelihood-based score in equation ( 6) with Ω −F k taken to be the glasso estimate.
All simulations were performed using the simulator package (Bien, 2016) in R (R Core Team, 2017).We evaluate the estimators in terms of three performance metrics: estimation accuracy, aggregation performance, and sparsity recovery.We evaluate estimation accuracy by averaging over many simulation runs the Kullback-Leibler (KL) distance where Σ = Ω −1 is the true covariance matrix.Note that the KL distance is zero if the estimated precision matrix equals the true precision matrix.
To evaluate aggregation performance, we use two measures: the Rand index (Rand, 1971) and the adjusted Rand index (Hubert and Arabie, 1985).Both indices measure the degree of similarity between the true partition on the set of nodes 1, . . ., p and the estimated partition.
The Rand index ranges from zero to one, where one means that both partitions are identical.
The adjusted Rand index performs a re-scaling to account for the fact that random chance will cause some variables to occupy the same group.
Finally, to evaluate sparsity recovery, we use the false positive and false negative rates The FPR reports the fraction of truly zero components of the precision matrix that are estimated as nonzero.The FNR gives the fraction of truly nonzero components of the precision matrix that are estimated as zero.

Simulation Designs
Data are drawn from a multivariate normal distribution with mean zero and covariance matrix Σ = Ω −1 .We take p = 15 variables and investigate the effect of increasing the number of variables in Section 4.3.We consider four different simulation designs, shown in unbalanced: aggregated graph q q q q q q q q q q q q q q q 1 2 3  Figure 5: A simple tree used for the "tag-lasso ideal" (left) and a more realistic tree used for the "tag-lasso realistic" (right).
of blocks.In the chain graph, adjacent aggregated groups are connected through an edge.
This structure corresponds to the motivating example of Section 1.In the random graph, one non-zero edge in the aggregated network is chosen at random.In the unbalanced graph, the clusters are of unequal size.In the unstructured graph, no aggregation is present.
Across all designs, we take the diagonal elements of Ω to be 1, the elements within a block of aggregated variables to be 0.5, and the non-zero elements across blocks to be 0.25.
We generate 100 different data sets for every simulation design and use a sample size of n = 120.The number of parameters (p + p(p − 1)/2 = 120) equals the sample size.
The tag-lasso estimator relies on the existence of a tree to perform node dimension q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q tag−lasso ideal tag−lasso realistic glasso tag−lasso ideal tag−lasso realistic glasso tag−lasso ideal tag−lasso realistic glasso tag−lasso ideal tag−lasso realistic reduction.We consider two different tree structures throughout the simulation study.First, we use an "ideal" tree which contains the true aggregation structure as the sole aggregation level between the leaves and the root of the tree.As an example, the true aggregation structure for the chain graph structure is shown in the left panel of Figure 5.We form A corresponding to this oracle tree to obtain the "tag-lasso ideal" estimator.
We also consider a more realistic tree, shown in the right panel of Figure 5, following a construction similar to that of Yan and Bien (2020).The tree is formed by performing hierarchical clustering of p latent points chosen to ensure that the tree contains the true aggregation structure and that these true clusters occur across a variety of depths.In particular, we generate K cluster means µ 1 , . . ., µ K with µ i = 1/i.We set the number of latent points associated with each of the K means equal to the cluster sizes from Figure 4.
These latent points are then drawn independently from N (µ i , [0.05 • min j (µ i − µ j )] 2 ).Finally, we form A corresponding to this tree to obtain the "tag-lasso realistic" estimator.
The tag-lasso ideal method performs nearly as well as the oracle.Comparing the tag-lasso realistic method to the tag-lasso ideal method suggests a minimal price paid for using a more realistic tree.
The "unstructured" panel of Figure 6 shows a case in which there is sparsity but no aggregation in the true data generating model.As expected, the glasso performs best in this case; however, we observe minimal cost to applying the tag-lasso approaches (which encompass the glasso as a special case when λ 1 = 0).
Aggregation Performance.Table 2 summarizes the aggregation performance of the three estimators in terms of the Rand index (RI) and adjusted Rand index (ARI).No results on the ARI in the unstructured simulation design are reported since it cannot be computed for a partition consisting of singletons.The tag-lasso estimators perform very well.If one can rely on an oracle tree, the tag-lasso perfectly recovers the aggregation structure, as reflected in the perfect (A)RI values of the tag-lasso ideal method.Even when the tag-lasso uses a more complex tree structure, it recovers the correct aggregation structure in the vast majority of cases.The glasso returns a partition of singletons as it is unable to perform dimension reduction through aggregation, as can be seen from its zero values on the ARI.Sparsity Recovery.Table 3 summarizes the results on sparsity recovery (FPR and FNR).
The tag-lasso estimators enjoy favorable FPR and FNR, mostly excluding the irrelevant conditional dependencies (as reflected by their low FPR) and including the relevant conditional dependencies (as reflected by their low FNR).In the simulation designs with aggregation, the glasso pays a big price for not being able to reduce dimensionality through aggregation, leading it to include too many irrelevant conditional dependencies, as reflected through its large FPRs.In the unstructured design, the rates of all estimators are, overall, low.

Increasing the Number of Nodes
We investigate the sensitivity of our results to an increasing number of variables p.We focus on the chain simulation design from Section 4.1 and subsequently double p from 15 to 30, 60 and 120 while keeping the number of blocks K fixed at three.The sample size n is set proportional to the complexity of the model, as measured by Kp+p.Hence, the sample sizes corresponding to the increasing values of p are respectively, n = 120, 240, 480, 960, thereby keeping the ratio of the sample size to the complexity fixed at two.In each setting, the number of parameters to be estimated is large, equal to 120, 465, 1830, 7260, respectively; thus increasing relative to the sample size.
The left panel of Figure 7 shows the mean KL distance (on a log-scale) of the four estimators as a function of p.As the number of nodes increases, the estimation accuracy of the tag-lasso estimators and the oracle increases slightly.For fixed K and increasing p, the aggregated nodes-which can be thought of as the average of p/K random variables-may be stabler, thereby explaining why the problem at hand does not get harder when increasing p q q q q Increasing number of nodes

Number of nodes
Mean KL distance 0.2 1 5 p=15 p=30 p=60 p=120 q tag−lasso ideal tag−lasso realistic glasso oracle q q q q Increasing number of blocks

Number of blocks
Mean KL distance for the methods with node aggregation.By contrast, the glasso-which is unable to exploit the aggregation structure-performs worse as p increases.For p = 120, for instance, the tag-lasso estimators outperform the glasso by a factor 50.
Results on aggregation performance and sparsity recovery are presented in Figure 12 of Appendix C. The tag-lasso ideal method perfectly recovers the aggregation structure for all values of p.The realistic tag-lasso's aggregation performance is close to perfect and remains relatively stable as p increases.The glasso is unable to detect the aggregation structure, as expected and reflected through its zero ARIs.The tag-lasso estimators also maintain a better balance between the FPR and FNR than the glasso.While their FPRs increase as p increases, their FNRs remain close to perfect, hence all relevant conditional dependencies are recovered.
The glasso, in contrast, fails to recover the majority of relevant conditional dependencies when p = 60, 120, thereby explaining its considerable drop in estimation accuracy.

Increasing the Number of Blocks
Finally, we investigate the effect of increasing the number of blocks K.We take the chain simulation design from Section 4.1 and increase the number of blocks from K = 3 to K = 5, 6, 10, while keeping the number of variables fixed at p = 30.The right panel of Figure 7 shows the mean KL distance (on a log-scale) of the four estimators as a function of K.As one would expect, the difference between the aggregation methods and the glasso decreases as K increases.However, for all K considered, the glasso does far less well than the aggregation based methods.
Similar conclusions hold in terms of aggregation and sparsity recovery performance.Detailed results are presented in Figure 13 of Appendix C. The tag-lasso ideal method performs as well as the oracle in terms of capturing the aggregation structure; the tag-lasso realistic method performs close to perfect and its aggregation performance improves with increasing K.In terms of sparsity recovery, the tag-lasso estimators hardly miss relevant conditional dependencies and only include a small number of irrelevant conditional dependencies.The glasso's sparsity recovery performance is overall worse but does improve with increasing K.

Financial Application
We demonstrate our method on a financial data set containing daily realized variances of Since the different observations of the consecutive days are (time)-dependent, we first fit the popular and simple heterogeneous autoregressive (HAR) model of (Corsi, 2009) to each of the individual log-transformed realized variance series.Graphical displays of the residual series of these 31 HAR models suggest that almost all autocorrelation in the series is captured.We then apply the tag-lasso to the residual series to learn the conditional dependency structure among stock market indices.
Estimated Graphical Model.We fit the tag-lasso estimator, with 5-fold cross validation to select tuning parameters, to the full data set, with the matrix A encoding the tree structure in Figure 8.The tag-lasso returns a solution with K = 6 aggregated blocks; the sparsity pattern of the full estimated precision matrix is shown in the top left panel of Figure 9.
The coloring of the row labels and the numbering of columns convey the memberships of each variable to aggregated blocks (to avoid clutter, only the first column of each block is labeled).
Dimension reduction mainly occurs through node aggregation, as can be seen from the aggregated precision matrix in the bottom right panel of Figure 9.The resulting aggregated graphical model is rather dense with only about half of the off-diagonal entries being non-zero in the estimated aggregated precision matrix, thereby suggesting strong volatility connect- q q q q q q q q q q −25 − edness.The solution returned by the tag-lasso estimator consists of one single-market block (block 5: Canada) and five multi-market blocks, which vary in size.The Australian, South-America, and all Asian stock markets form one aggregated block (block 6).Note that the tag-lasso has "aggregated" these merely because they have the same non-dependence structure (i.e.all of these markets are estimated to be conditionally independent of each other and all other markets).The remaining aggregated nodes concern the US market (block 4) and three European markets, which are divided into North-Europe (block 1), Central-, South-the latter two and the US play a central role as they are the most strongly connected nodes: These three nodes are connected to each other, the US node is additionally connected to Canada, whereas these European nodes are additionally connected with North-Europe.
Out-of-sample Performance.We conduct an out-of-sample exercise to compare the taglasso estimator to the glasso estimator.We take a random n = 203 observations (80% of the full data set) to form a "training sample" covariance matrix and use the remaining data to form a "test sample" covariance matrix S test , and repeat this procedure ten times.We fit both the tag-lasso and glasso estimator to the training covariance matrix, with 5-fold cross-validation on the training data to select tuning parameters.Next, we compute their corresponding out-of-sample errors on the test data, as in ( 6).
The top right panel of Figure 9 shows each of these ten test errors for both the tag-lasso (x-axis) and the glasso estimator (y-axis).The fact that in all ten replicates the points are well above the 45-degree line indicates that the tag-lasso estimator has better estimation error than the glasso.Tag-lasso has a lower test error than glasso in all ten replicates, resulting in a substantial reduction in test errors.This indicates that jointly exploiting edge and node dimension reduction is useful for precision matrix estimation in this context.

Microbiome Application
We next turn to a data set of gut microbial amplicon data in HIV patients (Rivera-Pinto et al., 2018), where our goal is to estimate an interpretable graphical model, capturing the interplay between different taxonomic groups of the microbiome.Bien et al. (2020) recently showed that tree-based aggregation in a supervised setting leads to parsimonious predictive models.The data set has n = 152 HIV patients, and we apply the tag-lasso estimator to all p = 104 bacterial operational taxonomic units (OTUs) that have non-zero counts in over half of the samples.We use the taxonomic tree that arranges the OTUs into natural hierarchical groupings of taxa: with 17 genera, 11 families, five orders, five classes, three phyla, and one kingdom (the root node).We employ a standard data transformation from the field of compositional data analysis (see e.g., Aitchison, 1982) called the centered logratio (clr) transformation that is commonly used in microbiome graphical modeling (Kurtz et al., 2015;Lo and Marculescu, 2018;Kurtz et al., 2019).After transformation, Kurtz et al. (2015) apply the glasso, Lo and Marculescu (2018) incorporate phylogenetic information into glasso's optimization problem through weights within the 1 -penalty, and Kurtz et al. (2019) estimate a latent graphical model which combines sparsity with a low-rank structure.
We instead, use the tag-lasso to learn a sparse aggregated network from the clr-transformed microbiome compositions.While the clr-transform induces dependence between otherwise independent components, Proposition 1 in Cao et al. (2019) provides intuition that as long as the underlying graphical model is sparse and p is large, these induced dependencies may have minimal effect on the covariance matrix.Future work could more carefully account for the induced dependence, incorporating ideas from Cao et al. (2019) or Kurtz et al. (2019).
Estimated Graphical Model.We fit the tag-lasso to the full data set and use 5-fold cross-validation to select the tuning parameters.The tag-lasso estimator provides a sparse aggregated graphical model with K = 28 aggregated blocks (a substantial reduction in nodes from the original p = 104 OTUs).The top panel of Figure 10 shows the sparsity pattern of the p × p estimated precision matrix (top left) and of the K × K estimated aggregated precision matrix (top right).A notable feature of the tag-lasso solution is that it returns a wide range of aggregation levels: The aggregated network consists of 17 OTUs, 7 nodes aggregated to the genus level (these nodes start with "g "), 3 to the family level (these nodes start with"f "), and 1 node to the kingdom level (this node starts with "k ").Some aggregated nodes, such as the "g Blautia" node (block 19), contain all OTUs within their taxa; some other aggregated nodes, indicated with an asterisk like the "k Bacteria*" node (block 28), have some of their OTUs missing.This latter "block" consists of 18 OTUs from across the phylogenetic tree that are estimated to be conditionally independent with all other OTUs in the data set.
While the tag-lasso determines the aggregation level in a data-driven way through cross   q q q q q q q q q q tag−lasso glasso q q q q q q q q q q Figure 11: Microbiome data.Left: Aggregated network estimated by the constrained CV version of the tag-lasso.The colour of the nodes is based on their level of aggregation (OTU: pink, genus: orange, family: blue); their width is proportional to the number of OTUs they aggregate.Middle: Network estimated by the glasso.Right: Test errors across the ten replications for the unconstrained (solid black) and constrained (unfilled blue) CV version of the tag-lasso versus the glasso.validation, practitioners or researchers may also sometimes wish to restrict the number of blocks K to a pre-determined level when such prior knowledge is available or if this is desirable for interpretability.As an illustration, we consider a constrained cross-validation scheme in which we restrict the number of blocks K to maximally ten and select the sparsity parameters with the best cross validated error among those solutions with K ≤ 10.The bottom panel of Figure 10 shows the sparsity pattern of the full and aggregated precision matrices estimated by this constrained version of the tag-lasso.
The resulting network consists of K = 8 aggregated nodes.The "k Bacteria*" node now aggregates 78 OTUs that are estimated to be conditionally independent with each other and all others.The interactions among the remaining nodes are shown in the left panel of Figure 11, which consists of three OTUs (OTU134, OTU156, and OTU161, in pink), three genera (Prevotella, Bacteroides, and Alistipes in orange) and one family (Porphyromonadaceae in blue).The resulting network is much simpler than the one estimated by the glasso, shown in the middle panel of Figure 11.The glasso finds 58 OTUs to be conditionally independent with all others, but the interactions among the remaining 46 OTUs are much more difficult to interpret.The glasso is limited to working at the OTU-level, which prevents it from providing insights about interactions that span different levels of the taxonomy.
Out-of-sample Performance.We conduct the same out-of-sample exercise as described in Section 5.1.The right panel of Figure 11 presents the ten test errors (black dots) for the unconstrained CV tag-lasso and glasso.In all but one case, the tag-lasso leads to a better fit than the glasso, suggesting that it is better suited for modeling the conditional dependencies among the OTUs.The unfilled blue dots show the same but for the constrained CV tag-lasso.In all ten cases, it underperforms the unconstrained CV tag-lasso (see shift to the right on the horizontal axis); however, its performance is on a par with the glasso, with test errors close to the 45 degree line.Thus, there does not appear to be a cost in out-of-sample-performance to the interpretability gains of the constrained tag-lasso over the glasso.

Conclusion
Detecting conditional dependencies between variables, as represented in a graphical model, forms a cornerstone of multivariate data analysis.However, graphical models, characterized by a set of nodes and edges, can quickly explode in dimensionality due to ever-increasing fine-grained levels of resolution at which data are measured.In many applications, a tree is available that organizes the measured variables into various meaningful levels of resolution.
In this work, we introduce the tag-lasso, a novel estimation procedure for graphical models that curbs this curse of dimensionality through joint node and edge dimension reduction by leveraging this tree as side information.Node dimension reduction is achieved by a penalty that allows nodes to be aggregated according to the tree structure; edge dimension reduction is achieved through a standard sparsity-inducing penalty.As such, the tag-lasso generalizes the popular glasso approach to sparse graphical modelling.An R package called taglasso implements the proposed method and is available on the GitHub page of the first author.

Appendices
A Proof of Proposition 2.1 Proof.First, note that X follows a K-dimensional multivariate normal distribution with mean zero and covariance matrix M (D + MCM ) −1 M. Next, we re-write this covariance matrix by two successive applications of equation ( 23) in Henderson and Searle (1981): Hence, the precision matrix of X is given by (M D −1 M) −1 + C. Now since (M D −1 M) −1 is diagonal, c ij = 0 ⇔ X i ⊥ X j | X −{i,j} , for any i, j = 1, . . ., K and with X −{i,j} containing all aggregated variables expect for aggregate i and j.
Step (i) decouples into four independent problems, whose solutions are worked out in  (14)

Figure 1 :
Figure 1: Top: True full graph and precision matrix Ω with corresponding aggregated graph and precision matrix.Middle: Estimation output of the tag-lasso.Bottom: Estimation output of the glasso.
1 d denote a d-dimensional column vector of ones, and I d be the d × d identity matrix.

Figure 2 :Figure 3 :
Figure 2: An example of a tree T encoding similarity among p = 5 variables.
Chandrasekaran et al. (2012);Tan et al. (2015);Eisenach et al. (2020);Brownlees et al. (2020);Pircalabelu and Claeskens (2020).Chandrasekaran et al. (2012) consider a blend of principal component analysis with graphical modeling by combining sparsity with a low-rank structure.Tan et al. (2015) and Eisenach et al. (2020) both propose two-step procedures that first cluster variables in an initial dimension reduction step and subsequently estimate a cluster-based graphical model.Brownlees et al. (2020) introduce partial correlation network models with community structures but rely on the sample covariance matrix of the observations to perform spectral clustering.Our procedure differs from these works by introducing a single convex optimization problem that simultaneously induces aggregation and edge sparsity for the precision matrix.

Figure 4 ,
Figure 4, each having a different combination of aggregation and sparsity structures for the precision matrix Ω.Aggregation is present in the first three structures.The precision matrix has a G-block structure with K = 3 blocks.In Section 4.4, we investigate the effect of varying the number

Figure 4 :
Figure 4: Four aggregation designs: chain, random, unbalanced and unstructured graphs with corresponding precision matrix (top) and graph on the set of aggregated nodes (bottom).

Figure 6 :
Figure 6: Estimation accuracy of the tree estimators relative to the oracle.

Figure 7 :
Figure 7: Estimation accuracy of the four estimators (on a log-scale) for increasing number of variables p (and fixed K = 3, left panel) the number of blocks K (and fixed p = 30, right panel).

pFigure 8 :
Figure 8: Geography-based tree for the stock market data, which aggregates the p = 31 stock market indices (leaves) over several sub-continents towards a single root.Leaves, which represent individual stock markets, are displayed horizontally.

Figure 9 :
Figure 9: Stock market indices data.Top left: Sparsity pattern (non-zeros in black) of full Ω with aggregation structure conveyed through row label coloring and column numbering.Top right: Test errors across the ten replications (dots) for the tag-lasso versus glasso.Bottom: Aggregated graph for the K = 6 nodes obtained with the tag-lasso as an adjacency matrix (bottom left) and as a network (bottom right) with the size of each node proportional to the number of original variables it aggregates.

Figure 10 :
Figure10: Microbiome data.Full precision matrix (left) and aggregated precision matrix (right) estimated by the tag-lasso with an unconstrained five-fold cross-validation (top) and with a cross-validation subject to the constraint that there are at most ten blocks (bottom).

Table 1 :
Toy example: Covariance and precision matrices with corresponding graphical model (drawn for p = 50) for the full (top) and aggregated (bottom) set of nodes.

Table 2 :
Aggregation performance of the three estimators, as measured by the Rand index (RI) and adjusted Rand index (ARI), for the four simulation designs.Standard errors are in parentheses.

Table 3 :
Sparsity recovery of the three estimators, as measured by the false positive rate (FPR) and false negative rate (FNR), for the four simulation designs.Standard errors are in parentheses.